DOCUMENT RESUME 



ED 258 989 



TM 850 357 



AUTHOR 
TITLE 

PUB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Livingston, Samuel A. 

Estimating the Reliability of Classifications Bas<2d 
on Composite Scores. 
19 Nov 84 

15p.; Paper presented at the Annual Meeting of the 
American Educational Research Association (69th, 
Chicago, IL, March 31-April 4, 1985). 
Reports - Research/Technical (143) — 
Speeches/Conference Papers (150) 

MF01/PC01 Plus Postage. 

Equated Scores; Essay Tests; *Estimation 
(Mathematics); Mathematical Models; Scoring; 
Statistical Analysis; Test Length; Test Reliability; 
*True Scores; Weighted Scores 
*Composite Scores; *Composite Tests 



ABSTRACT 

Much previously published material for estimating the 
reliability of classification has been based on the assumption that a 
test consists of a known number of equally weighted items. The test 
scure is the number of those items answered correctly. These methods 
cannot be used with classifications based on weighted composite 
scores, especially if the composite includes essay scores. This paper 
presents a modification which will make it possible to apply these 
methods to composite scores. The proposed method is based on a normal 
model (with variance stabilizing transformation) for the conditional 
observed score distribution. The effective test length of the 
composite is determined from its tr'-ie-score variance, estimated by 
Kristof's method or by Gilmer and Feldt's method. (Author/DWH) 
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Abstract 

Previously published methods for estimating the reliability of classify 
cation cannot deal with classifications based on weighted composite scores, 
particularly if the composite includes essay scores. This paper presents a 
method based on a normal model (with variance-stabilizing transformation) 
for the conditional observed-score distribution. The effective test length 
of the composite is deterir \ed from its true-score variance, estimated by 
Kristof's method or by Gilmer and Feldt's method. 
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July 30, 1984 

Estimating the Reliability of Classifications 
Based on Composite Scores 

Samuel A. Livingston 

The problem 

Several papers and articles have dealt with the problems of estimating 
the reliability of classifications based on test scores (e.g., Huynh, 1976; 
Subkoviak, 1976; Wilcox, 1981; Livingston and Wingersky, 1982). All of 
these articles are based on the assumption that the test consists of a known 
number of equally weighted items, scored simply as correct or incorrect, and 
that the test score is the number of those items answered correctly. This 
situation is certainly a common one. However, in some testing programs, 
students are classified on the basis of a composite score — a weighted sum 
of scores on two or more tests. The components may not be equally weighted. 
The component tests may include not only objective tests, but also essay 
questions. The student's score on each of the essay questions may be a 
scorer's Judgment, expressed on a scale with several possible values. In 
this case, determining the length of the test for the purpose of estimating 
reliability is more than a simple matter of counting test items. Can the 
methods that have been developed for estimating the reliability of classifi- 
cations be applied when the classification is based on such a composite? 
The purpose of this paper is to suggest a modification that will make it 
possible to apply these methods to composite scores. 

Notation 

Let X c represent the raw composite score >rmed from objective compox.ent 
X Q and essay components X,, X 2 » etc. with weights w Q , w^ w 2 » etc. Let T,, 
T rt , T., T_, etc. represent the corresponding true scores. 
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Then 



X c " W 0 X 0 + W 1 X 1 + W 2 X 2 
T c ■ Vo + W 1 T 1 + W 2 T 2 



Let "X c max" represent the composite score of a student who answers all the 
objective items correctly and receives the highest possible score on each 
essay question. 

The general method 

Many different statistics have been suggested for describing the 
reliability of classifications based on test scores. These include joint 
probabilities, conditional probabilities, conditional. score distributions, 
and summary statistics or indices. These statistics are all ways of 
summarizing the information contained in a joint distribution, and they can 
be applied to either of two joint distributions: (1) 'the joint 
distribution of true scores and observed scores (the "joint T,X 
distribution"), and (2) the joint distribution of observed scores on 
alternate forms of the test (the "joint X,X distribution"). 

If we can estimate the joint T,X distribution (true vs. observed 
scores), we can use it to estimate the joint X,X distribution (observed 
scores on alternate forms). The joint T,X distribution gives us both the 
conditional distribution of X, given T, «d the marginal distribution of T 
in the test-taker population. We assume that observed scores on alternate 
forms are independent, for students with a given true score. This 
assumption enables us to estimate the Joint X,X distribution, conditional on 
T. We then sum over the marginal distribution of T, to get the joint X,X 
distribution in the test-taker population. 
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How, then, do we estimate the joint T,X distribution? We need a model 
for the true-score distribution (of T) and a model for the conditional 
observed-score distribution (of X, given T) . For our true-score model we 
can fit a beta distribution (Lord, 1965; Huynh, 1976; Wilcox, 1981) or use 
the observed-score distribution itself (Subkoviak, 1976), with a 
transformation to shrink the variance. But what model can we use for the 
conditional distribution of observed scores? The binomial distribution is a 
suitable model for a score that is the sum of equally weighted, 
dichotomously scored items. What kind of model can we use for a composite 
of the type described above? 

A model for the conditional observed-sc ore distribution 

One way out of this dilemma is to assume that the conditional 
observed-score distribution of the composite is similar to that of an 
all-objective test having the same reliability as the composite. The 
conditional distribution of observed scores on such a test would be 
binomial, with parameters n and p, where n is the number of items on the 
and p - T/n. This distribution could be approximated closely by a normal 
distribution, if the scores were first transformed from X to 
X' - 2 arcsin y/T7n. To apply this model to the composite, first Express 
the composite score X £ as a percentage of its maximum value. Then, apply a 
variance-stabilizing transformation, to produce the transformed score 

X 1 - 2 arcsin v/3T c /X c max 
Assume that the conditional distribution of this transformed score, for ' 
students with true composite score T c> is normal, with mean 

T' -2 arcsinv/T /X„ max 
c v c c 



9 

ERJC 



6 



9 

ERIC 



- 4 - 

and variance 1/n , where n is the effe ctive test length of the composite 

c c — 

score X . That is, n is the length of an objective test having tne same 
c c 

reliability as the composite X c - To complete the model, we need an estimate 

of n . 
c 

Estimating effective test length . 

To estimate the effective test length of the composite, we must be 
willing to make an assumption that may actually be only approximately true. 
We must assume that the true scores T Q , T^, Tjt • . • are perfectly 
intercorrelated. That is, we must assume that if we had perfectly reliable 
measures of the skills measured by the objective portion and by each essay, 
these measures would correlate 1.00 with each other* 

It follows from this assumption that true scores on the composite will 
be perfectly correlated with true scores on the objective portion. In 
general, the standard deviation of true scores on a test is directly 
proportional to the length of the test. Therefore, we can reasonably define 
the effective test length of the composite score as the length of the 
objective portion n Q (which we know), scaled up by the ratio of the 
true-score standard deviations: 

n c " n 0 [8(T c )/s(T 0 )] 
We can estimate S(T Q ) by applying a conventional reliability formula (alpha, 

split-halves, etc.), to produce the estimate 

S (t 0 ) - sa 0 rfT Q 

where r Q is the reliability coefficient of the objective portion. If the 
objective scores include a correction for guessing, S(T q ) will be 
artificially inflated. To correct for this effect, multiply S(T q ) by 
k/(k-l), where k is the number of answer options per item. In this case, 

S(X ) must be computed without changing negative scores to zeroes. 

o 
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At this point, the missing link in the model is an estimate of S(T c ), 
the standard deviation of true scores on the composite. 

Estimating the true-score standard deviation of the composite . 

The problem of estimating s(T c ) is the same as that of estimating the 

reliability of X , since s(X ) can be observed and s(T) - s(X)\/r". Kristof 
' c c 

(1974) and Gilmer and Feldt (1983) have proposed methods for solving this 
problem. Kristof' s method is simpler but requires that the composite be 
defined as the sum of exactly three components. If the composite includes 
more than three components, it is possible to combine components to meet 
this requirement. If the composite includes only two components, it is 
necessary to divide one of them, presumably the objective portion, into two 
parts, creating a composite of three components. 

Kristof s formula, applied to a test consisting of an objective 
component and two essay questions, leads to the estimate 

C 01 C 02 * C 01 C 12 * C 02 C 12 
s(T ) 

N/ C 01 C 02 C 12 

where C Q1 , C Q2 , and C Q3 are the covariances of the weighted components, 
i.e. , 

c oi ' Yi Cov (x o* x i ); 

c 02 " V2 C ° V ( V X 2 ); 

°12 " w i w 2 Cov * x r X 2^* 

When the composite consists of exactly three components, Gilmer and 
Feldt's estimate is identical to Kristof's. When the composite Includes 
four or more components, Gilmer and Feldt' s method is considerably more 
complex than Kristof's, but also more accurate. Gilmer and Feldt (1983) 
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actually developed two methods. Their method "F2" is simpler to apply than 
their more complicated method "Fl". Only method "F2" will be presented 
here. 

Two modifications of Gilmer and Feldt's formulas are necessary. First, 
their development does not provide for weighting of the components. Second, 
their formulas provide a solution for the reliability coefficient, rather 
than the true-score standard deviation of the composite. With the necessary 
modifications, their method ca.n be expressed as follows. 

Compute the covariance matrix of the weighted components [c^], with 
cell entries 

c ± . » Cov (w^, WjX^). 
Let the subscript m indicate the row of this matrix for which the sum of the 

otf-diagonal entries is largest: 



Ic 4 * I c t . 
1*n m J jti « 



for all i 4 m. Define 

"ml — ■ *- 11 



c 



te that D -I. The Gilmer-Feldt "F2" estimate of the variance of 1 q is 



No__ „ 

m 



c * 



i i 



The square root of this quantity provides an estimate of s(T c ). 

This estimate of the true-score variance of the composite is the piece 
that completes the model. It leads to an estimate of the effective test 



S 



length of the composite. The estimate of effective test length gives us an 
estimate of the variance in the transformed-normal model for the conditional 
distribution^ of observed scores, given true score. We can put this model 
for the conditional observed-sccre distribution together with a model for 
the true-score distribution, to get a model for the joint distribution of " 
true scores and observed scores. With this model and the data from a 
reasonably large sample of test-takers, we can estimate the joint T,X 
distribution and summarize it any way we like. 

If we want to estimate the joint distribution of observed scores on 
alternate forms, we can begin by dividing the true-score range into fairly 
small intervals. (If we have used a Subkoviak-type true-score model, we 
have already made this partition.,) We can then assume conditional 
independence of the two observed score variables within each true-score 
interval, and compute the joint X,X distribution for each true score 
interval. We can then weight each of these joint observed-score 
distributions by the estimated number of test-takers in the true-score 
interval and sum over the true-score intervals. The result will be an 
estimate of the joint distribution of observed scores on alternate forms, 
which we can summarize any way we likw. 
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Appendix 

Adaptation of Gilmer and Feldt's deviation of coefficient M F2" (Gilmer and 
Feldt, 1983). 

• Let X » T w.X, » with all w > 0. 
c i i i 

Then T^Wj 

Assune that each component true score T^ is correlated +1.00 with the 
composite true score T c . Theft for each ^ there is a constant > 0 and a 
constant such that 



Then 



T. - a.T + b. • 
i i c i 



T c - ^ M i (a i T c * b i ) 
i * 



T Tw a ,+ Iwb . 
c i i 



And 

Var(T ) - ( Z w.a,) 2 Var (T ) 

C ^ 1 1 c 

because the w^, and are constants. And since all the w i and are 
positive i 

(i) IVi " u i 



Define 



Then 



Cov(w i X 1 , w j X j)« 



C ij " W i W j Cov ^ X i' 
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- v^Wj Cov(T 1 , Tj) 

(because of the independence of errors of measurement) 

- Vj CovCa^+b^ • J V b J ) 

(2) - a t a Var(T c ). 
Therefore, 

var ( x c ) - r c u ♦rrc lj . 

Solving for Var(T c ), 

Var(X ) - Ec 

(3) Var(T ) - 1 • 

Z £ w *, a a 
lftj 1 

To translate this expression into a usable estimate of Var(T c ), we need 
to express the denominator in terras of obser.vable quantities. Going back to 
equation (1) and squaring both sides, 

( Z w.a ) 2 - 1 

i f 



2 2 
j a. 

i 1 1 irfj 



I w 1 a 1 + II w 1 w J a i a J - 1 



2 2 

; w<w 4 a^ - 1 - Z v 
Substituting (4) into (3), 



(M I I w w a a - 1 -Z w a . 

M 1 J x J i 



Var(X ) -Z c tl 
(5) Var(T c ) - _ 1 

2 2 

1 - Z w iai 
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Consider the sum of the off-diagonal elements in the ith row of the weighted 
covariance matrix [c^]. minus the kth element: 

[ Z c ] - c k . 
j*i iJ ik 

By equation £2) ,. this quantity equals 

[ ^ V'jVj Var < T c > ] * w i w k a i a k Var(T c ) 

" W i a i. [( ^ W j a j ) - Vk 1 Var( V 

(6) - w^tl - - w k a k ] Var(T c ) 

(because, by equation (1), I w.a. * 1). 

i 

C 

Let row ro be 7 the row of the weighted covariance matrix [c ± j] having the 
largest sum of weighted covariances. Define the index 

Si v • ^ 

for the ith row of the matrix. Notice that D ± is defined entirely in terms 
of observable quantities and that - 1. From equation (6), 

w .11 - w a . w m a m 1 Var(T c ) 
D 

V^-V.-Yi 1 Var( V 

< 8 > " W i a i /W m a m ' 

Therefore , 

£ D. - Z W i a i - 1 Z w i a i - 1 • 

i iwa wa i w a 

m m m m mm 
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Therefore, 
(9) 



w a - 1 

m m 



ZD, 



Also, from equation (8), 



.ZD- z 

i X i 



w a 
_ m tn. 



(w a ) i 
mm 



t 2 2 
2 Z W i a i 



Therefore, 
(10) 



Z w a. - (w a ) Z°. 2 
i I m m . y I 



Substituting (9) into (10), 



_ 2 2 
^ W i a i 



1 



(ZD.) 
i 



Z D, 



Substituting (10) into (5), 

Var(Xj -He 



c- 7 -« 



Var(T ) - 
c 



r 



Z D, 



1 - 



(Z D )' 
i 



where - Cov (w^, w j X j) and D 1 is as defined in equation (7) 



15 



ERIC 



